May 15, 2015

Curiosity

Leads to search

Leads to discovery!

Leads to trapped data?

Leads to faceplant?

Data is messy!

  • In statistical modeling, we typically assume data is tidy.
  • That is, data appears in a tabular form where
    • 1 row == 1 observation
    • 1 column == 1 variable (observational attribute)
  • In practice, data hardly ever begins in this way (especially on the web).
  • Some have estimated 80% of the data science workflow is wrangling messy data (and 20% on actual analysis). Yet, these skills are not typically taught in the classroom.
  • We have to do better and stop avoiding the issue.
  • With the right tools, we can do better.
  • It helps if you know a bit about the web. If you don't, just remember to

Motivating Example

Inspecting elements

Hover to find desired elements

rvest makes scraping HTML pages super easy.

library(rvest)
# First, grab the page source
html("http://en.wikipedia.org/wiki/Table_(information)") %>%
  # then extract the first node with class of wikitable
  html_node(".wikitable") %>% 
  # then convert the HTML table into a data frame
  html_table()
##   First name   Last name Age
## 1     Bielat    Adamczak  24
## 2  Blaszczyk Kostrzewski  25
## 3 Olatunkboh    Chijiaku  22
## 4   Adrienne    Anthoula  22
## 5     Axelia  Athanasios  22
## 6  Jon-Kabat        Zinn  22
  • Note: html_table() only works on 'nicely' formatted HTML tables.

This is a nice format? Really? Yes, really. It's the format used to render tables on webpages.

<table class="wikitable">
  <tr>
    <th>First name</th>
    <th>Last name</th>
    <th>Age</th>
  </tr>
  <tr>
    <td>Bielat</td>
    <td>Adamczak</td>
    <td>24</td>
  </tr>
  <tr>
    <td>Blaszczyk</td>
    <td>Kostrzewski</td>
    <td>25</td>
  </tr>
  <tr>
    <td>Olatunkboh</td>
    <td>Chijiaku</td>
    <td>22</td>
  </tr>
</table> 

What if data isn't packaged neatly in <table>?

(selectorgadget + rvest) to the rescue!

Extracting links to reports

Your turn

Your turn solution

html("http://www.stat.iastate.edu/people/faculty/") %>%
  html_nodes("#content a") %>% html_attr(name="href") -> hrefs
head(hrefs)
## [1] "http://www.stat.iastate.edu/personal/?id=dcadams" 
## [2] "http://www.stat.iastate.edu/personal/?id=emilyb"  
## [3] "http://www.stat.iastate.edu/personal/?id=pcaragea"
## [4] "http://www.stat.iastate.edu/personal/?id=alicia"  
## [5] "http://www.stat.iastate.edu/personal/?id=songchen"
## [6] "http://www.stat.iastate.edu/personal/?id=dicook"

What about dynamic web pages?

Web APIs

  • Many popular dynamic sites have Web APIs (for example, Facebook, Twitter, YouTube, GitHub, etc). Use them!
  • A Web API is just a set of rules/standards for interacting with a Web server.
  • APIs typically require knowledge of HTTP, but a single HTTP verb (GET) is usually what you want.
  • For example,
library(httr)
me <- GET("https://api.github.com/users/cpsievert")
content(me)[c("name", "company")]
## $name
## [1] "Carson"
## 
## $company
## [1] "Iowa State University"

Accessing the DOM

Useful if you need access to the DOM, not just the page source.

library(XML)
library(RSelenium)
remDr <- remoteDriver(browserName = 'firefox')
remDr$open()
remDr$navigate("http://www.cbssports.com/collegebasketball/gametracker/live/NCAAB_20140407_UK@UCONN")

src <- remDr$getPageSource()
doc <- htmlParse(src, asText = TRUE)
nodes <- getNodeSet(doc, path = "//div[@class='court']")

kids <- sapply(nodes, xmlChildren)
recs <- sapply(kids, xmlAttrs)
head(t(recs))

pJS$stop()
  • We use RSelenium to test whether animint interactive plots behave correctly.
  • I've also used it to scrape websites that have to be rendered in a browser in order to access certain info.

Common data exchange formats

  • So far we've briefly covered how to extract information from HTML pages.
  • HTML is great for sharing content between people, but it isn't great for exchanging data between machines.
  • There are a ton of different ways to exchange data over the web, but by far the most popular ones are XML and JSON.

What is XML?

XML is a markup language that looks very similar to HTML.

<mariokart>
  <driver name="Bowser" occupation="Koopa">
    <vehicle speed="55" weight="25"> Wario Bike </vehicle>
    <vehicle speed="40" weight="67"> Piranha Prowler </vehicle>
  </driver>
  <driver name="Peach" occupation="Princess">
    <vehicle speed="54" weight="29"> Royal Racer </vehicle>
    <vehicle speed="50" weight="34"> Wild Wing </vehicle>
  </driver>
</mariokart>
  • This example shows that XML can (and is) used to store inherently tabular data (thanks Jeroen Ooms for the fun example)
  • What is are the observational units here? How many observations in total?
  • Two units and 6 total observations (4 vehicles and 2 drivers).

XML2R is a framework to simplify acquistion of tabular/relational XML.

library(XML2R)
obs <- XML2Obs("http://bit.ly/mario-xml")
table(names(obs))
## 
##          mariokart//driver mariokart//driver//vehicle 
##                          2                          4
  • The main idea of XML2R is to represent the nested list-like structure of XML as a flat list of observations.
  • The list names track the "observational unit".
  • The list values track the "observational attributes".

obs # named list of matrices. Each matrix is *guaranteed* to have 1 row.
## $`mariokart//driver//vehicle`
##      speed weight XML_value     
## [1,] "55"  "25"   " Wario Bike "
## 
## $`mariokart//driver//vehicle`
##      speed weight XML_value          
## [1,] "40"  "67"   " Piranha Prowler "
## 
## $`mariokart//driver`
##      name     occupation
## [1,] "Bowser" "Koopa"   
## 
## $`mariokart//driver//vehicle`
##      speed weight XML_value      
## [1,] "54"  "29"   " Royal Racer "
## 
## $`mariokart//driver//vehicle`
##      speed weight XML_value    
## [1,] "50"  "34"   " Wild Wing "
## 
## $`mariokart//driver`
##      name    occupation
## [1,] "Peach" "Princess"

collapse_obs(obs) # groups observations by their name/unit
## $`mariokart//driver`
##      name     occupation
## [1,] "Bowser" "Koopa"   
## [2,] "Peach"  "Princess"
## 
## $`mariokart//driver//vehicle`
##      speed weight XML_value          
## [1,] "55"  "25"   " Wario Bike "     
## [2,] "40"  "67"   " Piranha Prowler "
## [3,] "54"  "29"   " Royal Racer "    
## [4,] "50"  "34"   " Wild Wing "
  • What information have I lost by combining observations of the same unit into the same table?
  • I can't map vehicles to the drivers!

obs <- add_key(obs, parent = "mariokart//driver", recycle = "name")
collapse_obs(obs)
## $`mariokart//driver`
##      name     occupation
## [1,] "Bowser" "Koopa"   
## [2,] "Peach"  "Princess"
## 
## $`mariokart//driver//vehicle`
##      speed weight XML_value           name    
## [1,] "55"  "25"   " Wario Bike "      "Bowser"
## [2,] "40"  "67"   " Piranha Prowler " "Bowser"
## [3,] "54"  "29"   " Royal Racer "     "Peach" 
## [4,] "50"  "34"   " Wild Wing "       "Peach"

Now (if I want) I can merge the tables into a single table…

tabs <- collapse_obs(obs)
merge(tabs[[1]], tabs[[2]], by = "name")
##     name occupation speed weight         XML_value
## 1 Bowser      Koopa    55     25       Wario Bike 
## 2 Bowser      Koopa    40     67  Piranha Prowler 
## 3  Peach   Princess    54     29      Royal Racer 
## 4  Peach   Princess    50     34        Wild Wing

Your turn

Your turn 'solution'

Using add_key() is optional here, but it removes the need to hang on to the "game" table.

library(magrittr)
XML2Obs("http://gd2.mlb.com/components/game/mlb/year_2011/month_04/day_04/gid_2011_04_04_minmlb_nyamlb_1/players.xml") %>% 
  add_key(parent = "game", recycle = "venue") %>% 
  add_key(parent = "game", recycle = "date") %>%
  collapse_obs -> tabs
tabs[["game//team//player"]][1:5, c("first", "last", "venue", "date")]
##      first     last      venue            date           
## [1,] "Jim"     "Thome"   "Yankee Stadium" "April 4, 2011"
## [2,] "Carl"    "Pavano"  "Yankee Stadium" "April 4, 2011"
## [3,] "Michael" "Cuddyer" "Yankee Stadium" "April 4, 2011"
## [4,] "Joe"     "Nathan"  "Yankee Stadium" "April 4, 2011"
## [5,] "Joe"     "Mauer"   "Yankee Stadium" "April 4, 2011"

What about JSON?

  • JavaScript Object Notation (JSON) is comprised of two components:
    1. arrays => [value1, value2]
    2. objects => {"key1": value1, "key2": [value2, value3]}
  • NOTE: you can also have arrays of objects!
  • The preferred R package for R <=> JSON conversion has long been RJSONIO
  • However, jsonlite is gaining a lot of momentum/attention.
  • In fact, shiny will soon be moving from RJSONIO to jsonlite

Package downloads from RStudio's CRAN mirror

Back to Mariokart

[
    {
        "driver": "Bowser",
        "occupation": "Koopa",
        "vehicles": [
            {
                "model": "Wario Bike",
                "speed": 55,
                "weight": 25
            },
            {
                "model": "Piranha Prowler",
                "speed": 40,
                "weight": 67
            }
        ]
    },
    {
        "driver": "Peach",
        "occupation": "Princess",
        "vehicles": [
            {
                "model": "Royal Racer",
                "speed": 54,
                "weight": 29
            },
            {
                "model": "Wild Wing",
                "speed": 50,
                "weight": 34
            }
        ]
    }
]

library(jsonlite)
mario <- fromJSON("http://bit.ly/mario-json")
str(mario) # nested data.frames?!? 
## 'data.frame':    2 obs. of  3 variables:
##  $ driver    : chr  "Bowser" "Peach"
##  $ occupation: chr  "Koopa" "Princess"
##  $ vehicles  :List of 2
##   ..$ :'data.frame': 2 obs. of  3 variables:
##   .. ..$ model : chr  "Wario Bike" "Piranha Prowler"
##   .. ..$ speed : int  55 40
##   .. ..$ weight: int  25 67
##   ..$ :'data.frame': 2 obs. of  3 variables:
##   .. ..$ model : chr  "Royal Racer" "Wild Wing"
##   .. ..$ speed : int  54 50
##   .. ..$ weight: int  29 34

mario$driver
## [1] "Bowser" "Peach"
mario$vehicles
## [[1]]
##             model speed weight
## 1      Wario Bike    55     25
## 2 Piranha Prowler    40     67
## 
## [[2]]
##         model speed weight
## 1 Royal Racer    54     29
## 2   Wild Wing    50     34

How do we get two tables (with a common id) like the XML example?

# this mapply statement is essentially equivalent to add_key
vehicles <- mapply(function(x, y) cbind(x, driver = y), 
                   mario$vehicles, mario$driver, SIMPLIFY = FALSE)
rbind.pages(vehicles)
##             model speed weight driver
## 1      Wario Bike    55     25 Bowser
## 2 Piranha Prowler    40     67 Bowser
## 3     Royal Racer    54     29  Peach
## 4       Wild Wing    50     34  Peach
mario[!grepl("vehicle", names(mario))]
##   driver occupation
## 1 Bowser      Koopa
## 2  Peach   Princess

Thanks! Any questions?